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Abstract 

An amyloidogenic region (AR) in a protein sequence plays a significant role in protein aggregation and amyloid formation. 
We have investigated the sequence complexity of AR that is present in intrinsically disordered human proteins. IVlore than 
80% human proteins in the disordered protein databases (DisProt+IDEAL) contained one or more ARs. With decrease of 
protein disorder, AR content in the protein sequence was decreased. A probability density distribution analysis and discrete 
analysis of AR sequences showed that ~8% residue in a protein sequence was in AR and the region was in average 8 
residues long. The residues in the AR were high in sequence complexity and it seldom overlapped with low complexity 
regions (LCR), which was largely abundant in disorder proteins. The sequences in the AR showed mixed conformational 
adaptability towards a-helix, p-sheet/strand and coil conformations. 
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Introduction 

The available genome sequences and several computational 
methods have revealed a unique presence of some proteins which 
remain disordered under physiological condition and resemble 
their own functional states [1-9]. These proteins are known by 
difierent names like intrinsically disordered [10], natively dena- 
tured [1 1], natively unfolded protein and intrinsically unstructured 
proteins [3], [10]. The accepted convention is however intrinsi- 
cally disordered protein (IDP). It comprises of 25-30% of 
eukaryotic proteome and ~50% of eukaryotic proteins contain 
long disorder regions [12]. The IDPs lack any well-defmed three 
dimensional folded structures in solution and structurally they 
remain as an ensemble of interconverting conformations under 
physiological conditions [13-15]. The lack of a rigid and folded 
stable structure may provide large plasticity to IDPs to interact 
efficiently with dififerent targets, as compared to a globular protein 
with limited conformational flexibility [16], [17]. These charac- 
teristics possibly aid good efficacy to IDPs to be involved in 
different pathological and biochemical functions [5], [6], [13], 
[16], [18-20]. The functional domain varies from DNA binding to 
cell cycle regulation, membrane transport, difierent molecular 
recognition processes, and other important cellular functions [19], 
[21-23]. 

In addition to IDPs' important role in cellular activity, the 
inherited structural disorder plays an important role in the 
formation of protein assembly structure [24]. The structural 
disorder and flexibility of IDPs are also linked to formation of 
amyloid aggregates that is implicated in several human disorder 
such as Parkinson's disease, Alzheimer's disease, type II diabetes 
and others [25-30]. The major protein component of fibrillar 
deposits found in Parkinson's disease is a disordered protein. 



Ol-synuclein [25-30]. Alzheimer's disease is directly linked with 
production of ordered fibrillar structure of peptide A|342. Thus 
several neurological disorders are linked to formation of amyloid 
fibrils and their deposition in various cellular organs. 

However, it is not very clear how normally soluble disordered 
proteins/peptides are converted into amyloid fibre that possesses 
compact P-sheet structure. It has been also further observed and 
presented in many in vitro experiments that some structured 
proteins convert to amyloid fibrils under solution conditions where 
the proteins attained partial disordered structure [31], [32]. 
Experimental study and many computational analyses showed that 
short sequence stretches in proteins may be responsible and act as 
nucleating centres for amyloid fibril formation [33-36]. These 
regions are often known as amyloidogenic regions (ARs). 
Amyloidogenic sequences of six to eight residues when inserted 
in the C-terminal hinge loop of RNase A, the enzyme shows 
amyloidogenicity and forms amyloid fibres [34—36]. Presence of 
such regions in many water soluble proteins has been suggested by 
Dobson [36], [37] and others [38]. According to 'amyloid stretch 
hypothesis' [35], a short amyloid stretch (equivalent to AR) in a 
certain solution condition triggers the aggregation process. 
Mutation or reshuffling in this regions leads to decrease or total 
absence of such aggregation [33], [39]. Thus AR often acts as a 
nucleation center and governs protein aggregation that eventually 
leads to formation of P sheet rich amyloid fiber. 

The IDPs are also rich sequences with biased amino acid 
residues in a stretch, often known as low complexity regions 
(LCRs). These regions may also play a critical role in protein 
stability and energetic of fibril formation [1], [40-47]. LCRs are 
usually of two types: a majority of LCRs is composed of mixed 
polar and charged amino acid (aa) residues and the presence of 
such regions enhances protein solubility and mobility in solution. 



PLOS ONE I www.plosone.org 



1 



March 2014 | Volume 9 | Issue 3 | e89781 



Sequence Complexity of Amyloidogenic Regions 



Second type of LCR is a repeat of one/two sequence which is 
prone to form amyloid fiber. A good example of such region is a 
stretch of Glu (polyGlu) [48]. Thus the presence of LCR 
modulates the solubility and amyloidogenicity of disordered 
proteins [45], [49], [50]. 

The composition, content and distribution of ARs and LCRs in 
a protein sequence, therefore, may have a certain role in protein 
aggregation and amyloidogenicity. However, no major investiga- 
tion has been c arried out regarding sequence complexity of ARs 
and their spacing among LCRs which are commonly found in IDP 
sequences. In the present investigation, we computationally 
detected and analyzed the sequence composition and complexity, 
distribution pattern and structural aspects of ARs and LCRs in 
proteins those are deposited in DisProt and IDEAL databases [4] , 
[50] , [51]. About 8% residue is found to be in AR and the average 
length of the region is 8 residues. Further we have found that the 
sequences in AR are highly complex and they rarely overlap with 
LCR. 

Among many recently developed computational approaches 
and algorithms, we have used Waltz method that is developed by 
Maurer-Stroh et al. [52-56] to predict the ARs. The Waltz 
algorithm uses a position specific scoring matrix (PSSM) and 
combined physical properties and structural aspects of protein 
residues to identify AR [40], [41], [57], [58]. Computation tool 
SMART is used to predict the sequence complexity parameters. 
We have measured the structural propensity of the residues in AR 
by APSSP2 algorithm which is freely available in the World Wide 
Web [59], [60]. 

Materials and Methods 

Selection of Intrinsically Disordered Proteins 

DisProt database release 5.6 (http://www.disprot.org/) provides 
a set of proteins with different degree of disorderness [4] . It gives 
the name of the protein, accession codes, aa serju('nce, location of 
the disordered region(s), and methods used for structural (disorder) 
characterization. DisProt analysis also reveals biological function(s) 
of each disordered regions. Sequences of each protein were 
retrieved in FASTA format. Length, the aa composition, residue 
characteristics such as total number of positive and negative 
residues and theoretical isoelectric point (PI) were computed using 
the ProtParam tool of ExPASy Proteomic server (http://us. 
expasy.org/tools/protparam.html). The total charge of the pro- 
teins was calculated by 'protein ccilculator' server (http://www. 
scripps.edu/~cdputnam/protcalc.html). 

Additional disordered proteins were selected from IDEAL data 
set that contained experimentally verified IDPs [51]. The 
structural disorder of the proteins was varied from 0 to 100%. 
The proteins with (—1)% disorder were excluded. Structural 
disorder was further calculated using lUPred algorithm, which is 
available at http:/ /iupred.enzim.hu [61]. Protein disorderness was 
estimated by counting the number of residues in disordered 
regions in a protein as predicted by lUPred and it was divided by 
the length of the protein sequence followed by multiplication with 
100. 

Calculating LCR and AR 

Protein sequences obtained from DisProt and IDEAL were used 
to calculate both the LCR and AR. The contc^nt of LCR of an 
individual protein was predicted by SEC method as implemented 
in SMART (simple modular architecture research tool) [40] , [62] , 
a web based server available at http:/ /w\v^v.bork.embl-heidelberg. 
de/Modules/ sinput.shtml. Default SEG parameters were used for 
finding the LCR. The SEG method detects LCRs based on the 



measurement of information content present in the complexity 
state vector [40]. The ratio of total number of aa residues in all the 
LCRs of a protein to the protein sequence length was used to 
calculate the content of low-complexity region in a particular 
protein. Amyloidogenic region of the proteins was identified by a 
web based computational tool Waltz [56], http://waltz.switchlab. 
org. The % content of residues in AR in a protein was measured 
by taking a ratio of sequences in all the ARs and the sequence 
length of the protein. 

Prediction of Secondary Structure 

APSSP2 was used for the secondary structure prediction of each 
protein from their aa sequence [59]. The algorithm uses a 
sequence of amino acids as a query input and predicts the 
corresponding secondary structure with certain confidence level. 
Percentage's of r(;sidues those prefer to be in a-helix, [i-strand and 
coiled conformation were calculated by taking a ratio of total 
residues in a particular conformation to the sequence length of the 
proteins. Structural preferences of the residues in ARs and LCRs 
were obtained by selecting the respective sequence regions in the 
predicted structure of the protein. Percentage of AR/LCR 
sequence with a preference for a particular conformation was 
measured against the total number of AR/LCR sequence in the 
protein. 

Statistical Analysis 

AH the statistical analysis was performed in Wolfram Mathe- 
matica 8. Mean, standard error of mean (SEM), standard 
deviation (SD) were calculated for AR/LCR length and content. 
Stable distribution function (Text SI) with index of stability a, 
skewness parameter fi, location parameter |t, and scale parameter 
CT was fitted to the data to show distribution pattern of AR/LCR 
length and the AR/LCR content in a protein. Bivariate 
probability distribution such as smoothed kernel density distribu- 
tion was used to show the distribution of AR/LCR content with 
the protein length. To find the correlation between the AR/LCR 
content and protein sequence length negative hyperbolic equations 
were fitted to the data. 

Results 

Content of AR and LCR in Different Classes of IDPs 

The DisProt database analysis revealed 22 1 human proteins and 
432 nonhuman (other than human) proteins with different degree 
of disorderness. Table 1, Tables SI and S2 list some of these 
proteins with their physicochemical properties. Additional 186 
unstructured human proteins and 25 nonhuman proteins were 
obtained from IDEAL database (Tables S3 and S4). Tables SI, S2, 
S3, and S4 show the protein name, database ID and the % of 
protein disorder measured by lUPred. The tables also show the 
content (%) of AR and LCR in a particular group of proteins. Last 
two columns in the tables display the number of ARs found within 
15 residues from the C- and N- terminal of the protein sequence 
and these are marked as 'C and 'N' column, respectively. The 
DisProt database provides the content of structural disorder, 
however, the disorderness of all the proteins present in IDEAL and 
DisProt databases was calculated using lUPred server. The 
proteins from both the databases were arranged in a descending 
order of disorderness. The content (%) of AR sequences decreased 
with increasing order of structural disorder. However, a less 
number of LCR sequence was present in proteins with high 
content of structural elements. 

Based on the calculated disorderness, the proteins in each 
type (human/nonhuman) of proteins were grouped into three 
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Figure 1 . Content of AR and LCR sequences in different classes of disordered proteins. (A), DisProt human; (B), IDEAL human; (C), DisProt 
nonhuman and (D), IDEAL nonhuman. White bar signifying the LCR region, gray bar signifying the AR region and black bar signifying the overlapped 
region of AR and LCR. (E and F), Percentage of AR and percentage of LCR sequences in different group of disordered proteins, respectively. Bottom- 
axis in all the plots represents the three groups of disordered proteins with different degree of disorderness, PDP (0-30% disorder), MDP (31-70% 
disorder) and LDP (71-100% disorder). In (E) and (F), asterisks indicate the statistically significant difference from that of other groups (see Table S5). 
doi:1 0.1 371 /journal.pone.0089781 .gOOl 



categories as suggested in previous report [63]. Proteins with 71- 
100% structural disorder were grouped as largely disordered 
proteins (LDPs). Moderately disordered proteins (MDPs) possessed 
31-70% sequences in disorder region(s) and the remaining 
proteins with less than 30% sequences the disorder segment were 
grouped as partially disordered proteins (PDPs). Sequence details 
of the AR and LCR in this group of proteins are shown in Table 2. 
Figure 1 displays the graphical view of the analysis. The number of 
LDPs was less compared to MDPs and PDPs. Percentage content 
of amyloidgenic proteins (proteins that contained at least one AR) 
was also found to be less in LDP group. To gain confidence about 
this analysis, a t-test was performed based on sequence content (%) 
in an individual protein of each group (LDP, MDP and PDP). 
Confidence level was gained from the respective p-values as given 
in Table S5. 

Table 2 and Tables SI, S2, S3, and S4 show that some of the 
proteins in each group contained no AR. For instance, among 221 
human proteins in DisProt database, 191 (~86%) proteins were 
amyloidogenic and each contained at least one AR. 30 human 
proteins contained no ARs. The number of amyloidogenic 
proteins was maximum (93%) for PDPs. However, the value 
decreased to 70% for the LDPs. A similar trend was observed with 



nonhuman proteins as presented in Table 2 and Table S2. 
Analysis of protein sequence from IDEAL database also revealed a 
similar trend in the content of amyloidogenic protein in different 
group of proteins (Table 2 and Table S3). Percentage of sequences 
in low complexity region (LCR) in each and individual protein in 
DisProt and IDEAL databases are also given in Tables SI, S2, S3, 
and S4. A group wise distribution of the LCRs is presented in 
Figure 1 and Table 2. The content of LCR sequence (%) was 
maximum in LDPs and a little more than 20% of the sequence was 
found in LCR regions in human proteins found in DisProt. The 
content of LCR sequences was found to increase with the decrease 
of structural disorder. Nonhuman DisProt proteins contained 
slightly higher percentage (16%) of LCR sequences than the 
proteins in human category. The LCR sequence content in 
proteins of IDEAL database was less than the DisProt proteins. 
The content of LCR was least in PDPs. P-values from the t-test of 
some of the above comparison are given in Table S5. 

The sequence length of the AR/LCR and their content varied 
from protein to protein. Table 3 and Table S6 provide the 
sequence detail of the ARs, LCRs and the overlap regions between 
the two regions (AR/LCR). The table provides information 
regarding AR/LCR length and sequence position of the regions 
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Table 3. LCRs, ARs (*) and overlap regions (f) in some of the human disordered proteins from DisProt data. 





DisProt ID 


LCR/AR Protein length 


AR (%) 


LCR (%) 




DP00016 


GPRRGRDELG GGRRPG (81-96) 164 


0 


10 




DP00017 


RLLLAPRPVA VAVAVSPPLE PAAES (101-125) 316 


0 


43 






PSVPVPAPAS TPPPVPVLAP APAPAPAPVA 










APVAAPVAVA VLAPAPAPAP APAPAPAPVA 










APAPAPAPAP APAPAPAPAP DAAP (137-220) 








AAGTAAASAN GAA (251-263) 


VPAPCPSPSA APGVGSV (291-307) 


DP00039 


KRKAEGDAKG DKAKVKDE (2-19) 89 


0 


62 






AKPAPPKPEP KPKKAPAKKG EKVPKGKKGK 










ADAGKEG (2 9-65) 








DP00040 


SESSSKSS (2-9) 107 


0 


66 




KRGRGRPRKQ PP (23-34) 


PKRPRGRPKG SKNKG (54-68) 




KTRKTTTTPG RKPRGRPKKL EKEEEEGISQ ESSEEE 










(71-106) 








DP00069 


ATAATAPPAA PAGEGGPPAP PP (3-24) 116 


14 


33 




IlLGVICAII LIIIIV (97-112) 


VICAIILIII IVYFSS (101-116)* 


VlCAllLllI IV (101-112)1 


DP00070 


KAKEGWAAA EKTK (10-23) 140 


4 


21 




EGVLYV (35-40)* 


VTNVGGAWT GVTAVA (63-7 8) 


DP00126 


SKSKDGTGSD DKKAKGADGK TKIAT (129-153) 441 


1 


17 






PAKTPPAPKT PPSSGEPPKS GDRSGYSSPG 










SPGTPGSRSR 








TPSLPTPPTR EP (172-223) 


KVQIIN (274-279)* 


DP00174 


AFELI (19-23)* 149 


3 


0 




DP00199 


VLILACLVAL A (3-15) 226 


0 


38 




ETIESLSSSE ESITE (17-31) 


HEDQQQGEDE HQD (41-53) 


LPLAQPAWL PVPQP (82-96) 


LHLPLPLLQP LMQQVPQPIP Q (139-159) 




LLLNQELLLN (196-205) 








DP00214 


SHDHMDDMDD EDDDDHVDSQ DSIDSNDSDD 314 


0 


20 






VDDTDDSHQS 








DESHHSDESD E (81-131) 


EFHSHEFHSH E (272-282) 


DP00219 


ETVTETTVTV TTE (10-22) 126 


0 


37 




ESSTESDEEE EE (72-83) 


PTPTTPPQPP DPSQPPPGPM Q (105-125) 


DP00287 


EAEVGAEEAG VEEYGPEEDG GEESGAEESG 213 


8 


23 






PEESGPEELG 








AEEEMEAG (10-57) 


SQVIF (72-76)* 


IFANITLPVY TL (147-158)* 


DP00332 


GSSDSSEENG DDSSEEEEEE EETSNEGEN 317 


3 


41 





NEESNEDEDS EAENTT (62-106) 
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Table 3. Cont. 





DisProt ID 


LCR/AR 


Protein length 


AR (%) 


LCR (%) 




KEKESDEEEE EEEEGNENEE SEAEVDENE (145- 










173) 










TGANAEGTTE TGGQGKGTSK TTTSPNGG (207- 










234) 








GKTTTVEYEG EYEYTG (252-267) 


GQGYDGYDGQ NYY (302-314) 


GQNYYHHQ (310-317)* 


GQNYY (310-314)t 


DP00372 


HQAIIM (7-12)* 


106 


17 


0 


AVGNIF (35-40)* 


IIFAID (66-71)* 


DP00510 


EDEDSSLDES DLYSL (18-32) 


82 


0 


31 


GGGGRKGRTK RE (38-48) 


DP00521 


ATLIYV (2-7)* 


202 


3 


5 


PPSPVKMPSP P (163-173) 


DP00546 


GAERRCGPGP APPPPRAEA (16-34) 


175 


5 


21 


RRSREQKAKQ EREKELAK (116-133) 


VEAL lALTN (167-175)* 


DP00555 


EGVLYV (35-40)* 


134 


8 


28 


GAGNIA (73-78)* 




EEVAQEAAEE PLIEPLMEPE GESYEDPPQE 










EYQEYEPE (96-133) 








DP00592 


AAVAIQ (42-47)* 


62 


10 


0 


DP00617 


LLEEDDEFEE F (12-22) 


70 


0 


36 


VWEDNWDDDN VEDD (38-51) 


DP00630 


AVSEAWSSV NTVATKTV (65-82) 


127 


0 


30 


QQEGEASKEK EEVAEEAQSG (106-125) 


AP42 


KLVFFA (16-21)* 


42 


29 


0 


GGWIA (37-42)* 



Sequence positions are given in the parentiieses. Single letter code is used to represent individual aa residues. 
doi:l 0.1 371 /journal.pone.0089781 .ta03 



and the percentage of AR/LCR sequences in an individual 
protein. Individual AR lengths varied from 5 to 34 residues. The 
content of AR sequences was between 0 to 44% (Tables SI, S2, 
S3, and S4). For example, the shortest protein, 37 residues long 
antibacterial LL-37 (DP0004_C002) contained no AR, tau with 
441 amino acids enriched with 1.3% AR residues. DP00069 with 
sequence length of 1 16 was very rich in AR sequences (14%). 

In contrast to ARs, most of the LCRs were 8^0 residues long. 
The shortest LCR was 8 residues long. One such region was 
detected in DP00040. The largest LCR of 84 residues long was 
detected in DP00017. LCRs in tau (DP00126), for instance, 
occupied 17% of its total sequences. More than 35% residues in P- 
casern (DP00199) and regulatory subunit 1 (DP00219) were in 
LCRs. 

Statistical Analysis 

Statistical analysis was carried out to reveal the average of AR/ 
LCR content (%) and the length of the two regions (AR/LCR) in 
human proteins. To obtain the statistical parameters, AR/LCR 
content in all the human proteins from DisProt and IDEAL 



databases (Tables SI and S2) was combined. The total number of 
proteins examined was 407 and the combined number of AR and 
LCR were 1765 and 1348, respectively, (Table 2). 

A stable distribution function (see Materials and Methods and 
Text SI) was appUed to the experimental data (detected ARs and 
LCRs). Figure 2 shows the frequency histogram and the fitted 
distribution function for both the LCR and AR. Table 4 reports 
the statistical parameter values estimated from the fit to ARs/ 
LCRs. It was found that the statistical population (% of AR/LCR 
sequences) was characterized by a positive (and much larger than 
zero) value of the skewness coefficient. The mean value was ~8% 
of sequences for the AR. A similar distribution fit was made to the 
available lengths of the ARs/LCRs as shown in Figure 3 and the 
mean value was about 8 residues for the AR and 34 residues for 
the LCR. 

Figure 3 shows the smoothed kernel density estimation for the 
LCR/AR content in a protein (left and right panel, respectively). 
The plots have been shown in two different clipping planes. 
Bottom figure shows the smoothed 3D histogram. The smoothed 
kernel density estimation plot shows a distinct peak suggesting 
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Figure 2. Probability distribution of LCR and AR lengths and percentages. Distribution of LCR lengths (A) and percentage of LCR (B) in LCR 

containing disordered proteins. C and D, respectively; represent probability distribution of AR lengths and AR content (%) of IDPs. Fitted statistical 
parameters are given in Table 4. Histograms of data are shown with a suitable bin size. 
doi:1 0.1 371 /journal.pone.0089781 .g002 



~8% AR content in a ~400 aa long protein and indicated that the 
detected proteins in the two databases populated at ~400 aa long 
and largely contributed to the estimate of average content of the 
AR and LCR. No correlation could be observed between the AR/ 
LCR content and protein length (Figure 4). Although at deeper 
clipping plane it suggested a negative hyperboUc fit i.e. with the 
increase in protein length there is decrease in the AR/LCR 
content. However, no significant fit could be obtained to validate 
this assumption. 

Sequence Aspects of AR and LCR 

One interesting observation was that a major number of 
proteins contained both the AR and LCR, however, the two 
regions rarely overlapped with each other (Figure 1, Tables SI, S2, 
S3, and S4, Table 3 and Table 5). For instance, DisProt human 



proteins contained 894 ARs and 638 LCRs, however, only 53 
occurrences of sequence overlapping between the two regions 
were observed and in most of the cases the overlap was partial 
(Table 5). A LCR with residues 97-112 in DP00069 overlapped 
with C-terminal AR of residues 101-116, and the overlapping 
region contain 12 residues. Whereas in DP00332, LCR with 
residues from 302-314 overlapped with an AR (310-317). Only 
four residues were found in the overlapping region. Similarly four 
ARs from DP00119, DP00551, DP00643_A002 and DP00683 
partially overlapped with the LCRs. In other group of proteins 
also a similar result was obtained. Among 1889 AR regions in 
DisProt nonhuman proteins, only 74 ARs overlapped with the 
LCRs. In an average, ~3% of the AR sequences overlapped with 
the LCR sequences. These observations clearly indicated that the 



Table 4. Statistical analysis on AR/LCR length/content. 









AR percentage 




LCR percentage 


stable distribution parameters 


AR length distribution 


distribution 


LCR length distribution 


distribution 


Index of stability, a 


1.02 


1.34 


0.92 


1.08 


Skewness parameter, p 


0.99 


0.99 


0.99 


0.99 


Location parameter, \i 


6.55 


9.73 


14.99 


9.73 


Scale parameter, ts 


0.94 


2.24' 


4.67 


2.24 



Stable distribution function fitting parameters. 
doi:1 0.1 371 /journal.pone.0089781 .t004 
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Figure 3. Smoothed kernel density estimation for the LCR and AR content in a protein. Left and right panel, respectively, represents the 
density for LCR and AR. The plots have been shown in two different clipping planes. Bottom figures show the smoothed 3D histogram for the AR and 
LCR. 

doi:1 0.1 371 /journal.pone.0089781 .g003 



residues in AR were very complex and rarely overlapped with the 
LCR. 

We also calculated average content of difFerent types of amino 
acid residues in both the AR and LCR. Figure 5 displays the 
average content of diflferent types of residues present in the AR, 
LCR and total proteins. A major fraction of the AR residues was 
hydrophobic and Leu was the most abundant (12.6%) residue. 
Other major residues in the region were He (11.2%), Phe (8.8%), 
Tyr (8.6%), Val (8.1%), Ala (7.3%). The AR regions were depleted 
in Pro, Lys, His and others. A major number of residues in the 
LCR was hydrophUic in nature and the regions were enriched 
with Ser (13.1%), Pro (12.1%), Gly (9.8%) and Ala (9.2%). 

The structural propensities of residues in the ARs were 
measured using the APSSP2 algorithm (see Materials and 
Methods). The analysis showed that the conformational preference 
of the AR residues was not confined to any particular structure, 
rather in average a mixed structural preference of the AR residues 
was observed in aU three groups of proteins. Figure 6 displays the 
overall structural heterogeneity of the AR sequences present in 
human (DisProt) proteins. The average number of sequence that 
preferred ot-helical conformation was ~38%. Preferences for P- 
sheet/strand and coil conformations were ~31% and ~32%, 
respectively. This result indicated that all of the sequences in the 
ARs did not favour P-conformation. When compared with total 
protein sequence present in the same group of proteins, about 
56% residues preferred coil conformation and ~30% residues 
showed structural propensity towards a-helical conformation. 
Remaining 14% favoured P-sheet/strand conformations. Number 
of residues that preferred P-sheet component increased substan- 
tially in the ARs, however, large fraction of the AR residues (38%) 
favoured a-helical conformation. 

Discussion 

It is known from previous investigations that AR acts as a key 
for several protein aggregations and amyloid fibril formation. In 
this report we detected ARs by using Waltz algorithm and 
analyzed computationally the sequence complexity, conforma- 
tional preference and the distribution of ARs in disordered human 
proteins present in Disprot and IDEAL databases. There are 
several methods to detect ARs [56], [64-66]. Some important 
algorithms and software to predict aggregation aspects of proteins 
are Tango [55], Waltz [56], PASTA [67-70], Aggrescan [71], 
SALSA [72], Zyggregator |;73], AmylPred [64], FoldAmyloid 
[74]. The ability of the protein sequences to form P-strands/sheets 
is a predominant feature in most of these algorithms. PASTA was 
developed based on hidden P-propensity of the protein sequences 
[67-70]. Aggrescan software was based on an aggregation 
propensity scale for the 20 natural amino acids [71]. This method 
stressed that short and specific sequence stretches were responsible 
for protein aggregation. Based on average packing density of the 
aa residues, FoldAmyloid identified a sequence pattern that could 
promote amyloid fibril formation [34]. Waltz methodology was 
used in this investigation because many of its selected regions were 
experimentally verified and the method was better capable to 
differentiate amyloid fiber formation and amorphous aggregates 
[56]. 

The investigation revealed that more than ~80% disordered 
human proteins (DisProt and IDEAL databases) possessed at least 



one AR, indicating that a significant number of disordered 
proteins were amyloidogenic. Waltz detected ARs from a large 
number of proteins in DisProt and IDEAL databases. The large 
number of data set helped to derive, along with discrete analysis 
(Table 6), statistical average of AR and LCR sequence percentage 
and the average of AR and LCR sequence length. Discrete 
analysis result of all groups of proteins is given in Table 2 and 
Table 6. The average values did not differ much with statistical 
analysis result (Table 4). However, the statistical values may be 
more acceptable to represent the average properties and 
composition of the LCRs and ARs. 

Percentage of amyloidogenic proteins was higher in the PDP 
groups. Thus the content of AR sequences was more in proteins 

A) 

lOO a I I 




3000 



Protein sequence length 

B) 

40| 




0 500 1000 1500 2000 2500 3000 



Protein sequence length 

Figure 4. Correlations between content of LCR and AR 
sequence with the protein length. (A) Correlations between 
content of LCR sequence with the protein length. No significant 
correlation could be obtained for the LCR content in a protein 
sequence. The figure shows a negative hyperbolic fit 
(y = 9.44056+1 926.61 /x; R^ 0.113058) with standard deviation bands 
(at la, 2a, and 3a). (B) Correlations between content of AR sequence 
with the protein length. No significant correlation could be obtained for 
the AR content in a protein sequence. The figure shows a negative 
hyperbolic fit (y = 6.05937+651. 62/x; R2, 0.112173) with standard 
deviation bands (at la, 2a, and 3a). 
doi:1 0.1 371/journal.pone.0089781 .g004 
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Table 5. Overlapping regions in DisProt human proteins. 




Disprot ID 




LCR/AR overlap region 




DP00069 


LCR 


IILGVICAIILIIIIV (97-112) 






AR 


VICAIILIIIIVYFSS (101-116) 




DP00332 


LCR 


GQGYDGYDGQNYY (302-314) 






AR 


GQNYYHHQ (310-317) 




DP00119 


LCR 


LLILLSVALLALSSAESSSEDVSQEESL (2-29) 






AR 


SLELIS (28-33) 




DP00551 


LCR 


ALLLLLFLHLAFL (10-22) 






AR 


— LLLLFLHLAEL (12-22) 




DP00643_A002 


LCR 


VILRLLRYIVRLVWR — (122-136) 






AR 


LLRYIVRLVWRMH (126-138) 




DP00683 


LCR 


LVSVYNSYPYYPYLY- (210-224) 






AR 


LVSVYNSYPYYPYLYC (210-225) 




DP00012 


LCR 


FNSSAFEFSGFEVVFLSV (305- 


322) 




AR 


AYVRYFNSSAFFFSGFFVVFLSVLPYALIKGIIL (300- 


333) 




LCR 


IQLLLIVIGAIAWAVLQ (995-1012) 






AR 


-QLLLIVIGAIA (996-1006) 






LCR 


-lEVlFFlAVTEISI — (1106-1119) 






AR 


MIFVIFFIAVTFISILT (1105-1121) 




DP00074 


LCR 


AAYEFNAAAAANA (58-70) 






AR 


AAYEFN (58-63) 






LCR 


LTLQQQHQRLAQLLLIL- (495-511) 






AR 


QLLLILS (506-512) 




DP00099 


LCR 


TIITPPTPIIP (336-346) 






AR 


AGWTllT (333-339) 




DP00162 


LCR 


TTGVVTVIVILIAIAALGALILG (9-31) 






AR 


IVILIAIAALGALILGCWCYL (16-36) 




DP00191 


LCR 


LLLLLFL — (8-14) 






AR 


-LLLLFLKS (9-16) 




DP00231 


LCR 


QTPQGQQGLLQAQNLLTQLPQQ (210-231) 






AR 


AQFIISQ (204-210) 




DP00272 


LCR 


LALADALATSTL (112-123) 






AR 


ATNIYIFNLA (104-113) 




DP00282 


LCR 


KNNWNIEDNNIKN (1132-1144) 






AR 


-NNWNIE (1133-1138) 




DP00306 


LCR 


ITILIIALIAL (51-61) 






AR 


NWFITILIIALIALSVGQYN (47-67) 




DP00307 


LCR 


LEQILEYELLLIQQL (140-154) 






AR 


ELLLIQQLNFHLIV (147-160) 




DP00311 


LCR 


AVAGLVLVALLAILV (232-246) 






AR 


ALLAILVENWH (240-250) 




DP00314 


LCR 


PKLPDDTTFPLPPPRPK (149-165) 






AR 


KNVIFE (165-170) 




DP00317 


LCR 


TEKRKKRSTKKE (301-312) 






AR 


EVFNILQAAYV (312-322) 




DP00324 


LCR 


GGNFGGRSSGPYGGGG (329-344) 






AR 


GGQYF (343-347) 
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Table 5. Cont. 





Disprot ID 




LCR/AR overlap region 




DP00338 


LCR 


MILFLIMLVLVLF (20-32) 






AR 


-ILFLIMLVLVLFGYG (21-35) 




DP00339 


LCR 


MILFLIMLVLVLF (20-32) 






AR 


-ILFLIMLVLVLFGYG (21-35) 






LCR 


GDFYYLGGFFGG (261-272) 






AR 


GDFYYLGGFFG- (261-271) 




DP00356 


LCR 


NNQYFNHHPYPHNHYMP (12 0-136) 






AR 


NNQYFN (120-125) 




DP00381 


LCR 


NNTQTTTHLQPLHHP (819-833) 






AR 


ELNNINNTQ (814-822) 




DP00406 


LCR 


LQALYALQALWTL- (1522-1535) 






AR 


LQALYALQALWTLE (1522-1536) 




DP00428 


LCR 


LELCRRRSLLEL (130-141) 






AR 


NDFVFWLEL (123-132) 




DP00448 


LCR 


LWKTALKLLLVFV (217-230) 






AR 


LLLVFVEYS (225-233) 




DP00464 


LCR 


KKLKEKKDELD (45-55) 






AR 


LDSLITAITTN (54-64) 




DP004e6 


LCR 


SPPVILLISFLIFLIV- (237-252) 






AR 


VILLISFLIFLIVG (240-253) 




DP00467 


LCR 


AKPNATTANGNTALAIA (785-801) 






AR 


TALAIA (796-801) 




DP00503 


LCR 


LLIILFIIVPIFLLL (167-181) 






AR 


KDGIIMIQTLLIILFIIVPIFLL- (158-180) 




DP00508 


LCR 


LAVLILAIILL (7-17) 






AR 


LAVLILAIILLQGTLAQ (7-23) 




DP00519 


LCR 


SSGAKSPSKSGA (1355-1366) 






AR 


KAVEFSS (1350-1356) 






LCR 


LEELEKERSLLLADLDKEEKEKD (134- 


156) 




AR 


KDWYYAQLQNLTK (155- 


167) 


DP00520 


LCR 


KSPKGSGKPPGVPASSKSGK (332-351) 






AR 


KAFSYYL (351-357) 




DP00553 


LCR 


ASLLFLNVLAFAAL- (716-729) 






AR 


ASLLFLNVLAFAALY (716-7 30) 




DP00574 


LCR 


GPGRLEREAAAAAATTPAPTAGAL (52-75) 






AR 


AGALYSG (72-78) 






LCR 


SGSEGDSESGEEEELGAE (77-94) 






AR 


AGALYSG (72-78) 




DP00616 


LCR 


T^/T^T^/TTT^Tr'ZiTr'Tr'Tfl f"^ — 1 Q\ 
J-iVr J-iVljiji: ijtjAijlci-l-i'^ijrt. { O J- ^ ) 






AR 


LVLLFLGA (6-13) 




DP00628 


LCR 


LRELSELSLLSL — (235-246) 






AR 


LLSLYG (243-248) 




DP00e32 


LCR 


YSTYSQAAAQQGYSAYTAQ (6-24) 






AR 


GYSAYTA- (17-23) 






LCR 


SYTQAQTTATYGQTAYATSYGQPPTGYTTPTAPQA 


51-85) 



AR TDVSYTQAQTTATYGQTAYATSYG (48-71) 
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Table 5. Cont. 







Disprot ID 




LCR/AR overlap region 




LCR 


QPVTAPPSYPPTSYSSTQPTSYDQSSYSQQNTYG...QSS (182-2 66) 




AR 


QQNTYG (210-215) 


DP00633 


LCR 


-LQAYQQRLLQQQ (2257-2268) 




AR 


SLQAYQ (2256-2261) 


DP00641 


LCR 


AALLWLLLIAAA — (5-16) 




AR 


AALLWLLLIAAAFS (5-18) 


DPooeee 


LCR 


IILLLLVLLIL — (1130-1140) 




AR 


LVLLILCF (1135-1142) 


DP00670 


LCR 


AVAAAAIFVIIIF- (314-326) 




AR 


— AAAAIFVIIIFY (316-327) 


DP00706 


LCR 


GKGDSSGFSSYSGSSSSGSSISSARSSGGGSSG...AGS (58-105) 




AR 


GFSSYS (64-69) 




LCR 


GYSQVSYSSGSGSSLQGASGSSQLGSSSSHSGNSGS...GSA (111-175) 




AR 


— SQVSYSS (113-119) 


Length and sequence positions are given in tlie parentheses. Single letter codes are used to represent individual aa residues. Overlapping regions are aligned. Only the 
proteins with AR/LCR overlapping regions are shown. 
doi:1 0.1 371 /journal.pone.0089781 .t005 



with less structural disorder or in structured proteins. A similar 
observation was also made by Linding et al. [75]. These proteins 
contained less number of LCRs which were composed of less 
number of hydrophobic amino acids. LCR thus may have a 
significant role in protein aggregation process and amyloid 
formation. AR may be exposed to start the aggregation process 
and LCR regions could have certain role in the process. However, 
a large number of LCR along with a high content of polar amino 
acids and attenuated hydrophobicity may not allow the protein to 
misfold/fold further to gain P-sheet rich amyloid aggregate, in 
largely disordered proteins [3] . Therefore, the content of AR and 
LCR and the unique balance between the two regions are very 
crucial for protein stability (for disordered proteins) and amyloid 
formation. A proper solution condition may be needed based on 
the content of AR/LCR to unfold the region of structured proteins 
partially or fuUy to trigger amyloid fiber formation [76]. Nature 
may have designed the disordered proteins with a unique balance 
of AR and LCR sequences to provide stability and the ability to 
perform multifunction. However, an external disturbance or 
change in internal cellular condition may break this unique 
balance and could enhance protein aggregation and amyloid 
formation. 

Most of the detected ARs in amyloidogenic proteins were sbc to 
eight residues long. We detected six residues long (residues 35-40) 
AR in a-synuclein. It was significantiy shorter than the aggregation 
prone segment obtained by Der-Sarkissian et al. Zhang et al. 
showed four additional segments that might be involved in ot- 
synuclein aggregation [72]. However, the used methods did not 
define adequately the characteristics of nucleation site of amyloid 
formation. Waltz allowed identification and better distinction 
between amyloid sequences from the protein segments that 
promote P-sheet rich amorphous aggregates, and that could be a 
possible reason of less number of AR regions found in this 
investigation. 

Statistical analysis results and discreet analysis (Tables SI, S2, 
S3, and S4, Table 6) established that the content of AR sequences 
was not always proportional to the protein sequence length. It 



showed a negative hyperbolic correlation among the protein 
sequence length and the percentage of AR/LCR sequence 
(Figure 4). The reason of this was not known. Chiti et al. observed 
less aggregation propensity of proteins those were longer with 
respect to short proteins [7 7] . The longer proteins thus may have 
evolved with attenuation (low content) of ARs to reduce unwanted 
aggregation and fibril formation. It would be interesting, however, 
to test whether increasing number of ARs could enhance the 
aggregation kinetics or the quality of fibril formation in longer 
proteins. 

In this regard, it was also important to know the conformational 
preferences of AR residues. We observed that aa residues in the 
ARs showed propensity towards a-helix, P-sheet/strand and coil 
conformations and all the residues were not very hydrophobic. 
Waltz, used in this investigation, did not fully rely on P-sheet 
structural propensity of the residues but was built on PSSM and on 
consideration of other physicochemical properties of the protein 
sequences. It allows some tolerance towards charged and polar 
residues with different hidden structural propensity. Proteins with 
diverse structural domains (P-sheet, ot-helix, or random coH) 
including globular proteins were found to produce aggregates with 
fibrillar structure under certain solution condition [23], however, a 
crucial structural rearrangement often occurred during conversion 
of these proteins into amyloid fiber [78]. Thus slightly polar amino 
acids or the presence of LCR may play important role in structural 
reorganization. 

Aggregation propensity and overall protein aggregation may 
also depend on the location of AR in the protein sequence, and 
how the ARs are surrounded by local excess of polar/ charged 
amino acids or LCRs. Kar et al. recently showed that addition of a 
polyproline sequence to C-terminal side of polyGlu slowed 
aggregation of the peptide [48] . However insertion of the same 
residues to the N-terminal side of polyGlu caused very little effect 
on overall aggregation of the peptide. N-terminal residues in 
Huntingtin protein situated adjacent to the polyGlu sequence 
dramatically altered aggregation property of the peptide. Howev- 
er, position dependent role of LCRs, rich in polar and charged 
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iSeriesl 
lSeries2 
iSeriesS 



Figure 5. Content of different types of aa residues present in the LCR, AR and total proteins. The panel compares the percentage of 
individual aa residues in the LCR (Series 1, blue), AR (Series 2, red), and total protein (Series 3, green). X-axis started with the most abundant residues 
in the AR. The amino acid residues are presented with a single letter code along the bottom axis. 
doi:1 0.1 371 /journal.pone.0089781 .g005 



residues, on aggregation propelled by ARs was not known with 
certainty. According to amyloid stretch hypothesis the AR 
containing proteins were needed to be locally/partially unfolded 
to initiate and promote the process of amyloid fiber formation 
[35] . Thus the presence of LCR in a protein with less disorder may 
significandy alter the amyloid formation kinetics. 

The IDPs play a vital role in molecular recognition process and 
the interaction has found to lead formation of structured protein 
complexes. A model of molecular recognition features or elements 
(MoRFs) has been proposed to define this interaction and the 
reorganization processes [79-82]. The MoRF model recognizes, 
in a disordered protein sequence, a linear region that undergoes a 
disorder-to-order transition upon binding to its partner. These 
regions are often referred as MoRFs. The regions could attain a- 
helices, form fi-strands (f5-MoRFs), irregular structures (t-MoRFs), 
and a combination of all these structural elements upon binding to 
its partner. However, our analysis largely directed to find the 
amyloid forming region and the region of protein sequences that 
are sequentially less complex. Both the AR and LCR could be part 
of MoRFs and may be involved in molecular reorganization 
process. However, further analysis may be needed to address this 
issue. 



One of the significant observations was that the AR sequences 
were highly complex. Our analysis with IDPs showed that ~20% 
sequence was in the LCR and the value was close to the overall 
predicted value for SWISS-PROT database [41]. However most 
(greater than 97%, Table 2) of the AR sequences were not within 
the LCRs. It indicated complexity pattern of the AR sequences 
and confu-med the presence of less number of biased aa residues in 
the ARs. Some LCRs with one or more aa residues form stretches 
of a single amino acid, produce homopolymeric structure [41], 
[49], [40], [83] and became amyloidogenic [84]. However, we 
could detect in IDPs no such LCR which were polymeric in nature 
and amyloidogenic. Many prion proteins, e.g mammalian PrP, the 
yeast prions, Ure2p and Sup35 contain disordered stretches that 
also form beta sheet rich aggregates. These aggregate prone 
domains are also found to contain segments with low sequence 
complexity and often are enriched with Glu/Asp [85-88]. Thus 
prion proteins also contained both the ARs and LCRs. A test was 
performed with prion protein (P04156) and Huntingtin (P42858), 
however waltz methods could detect the palindromic region 
(residue 112-119) in P04156 and polyQ region in Huntingtin 
(P42858) only when 'custom' is used as the threshold in the 
analysis [56]. In our analysis, 'best overall performance' was used 
as the threshold and it missed the detection of above two 
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Figure 6. Comparison of tKie conformational preferences of residues in the ARs with that of total protein. A 3D plot shows the 
percentage of residues with conformational preference for a-helix (green), (3- strand/sheet (red) and coil (blue) for total proteins and their ARs as 
represented in X-axis. Lower panel shows the 2D plot of the above data along with the error limits. 
doi:1 0.1 371/journal.pone.0089781 .g006 



amyloidogenic regions. We also analysed the content of ARs and 
LCRs in a group of proteins which were amyloidogenic and the 
amyloidogencity of the proteins were experimentally proven [56]. 
The list of the proteins and the analysis results are shown in 
Table 7. It includes protein like insulin, prion protein (P04156) and 
yeast protein Sup 35 (P05453). The observation was that the 
sequence overlapping of the AR and LCR were also very less 
(Table 7). This indicated that the ARs are compositionaUy highly 
complex. As such the sequence complexity and structural 
heterogeneity of the AR sequences was a vital observation. Also 
a few % of residues that overlapped with the LCR showed mixed 
structural propensity. The C terminal LCR in DP00069 that 
overlapped with the AR contained seven lie (not at a stretch) and 
these residues showed preference for a-helical conformation. The 
overlapping sequences of AR and LCR, however, in DP00332 
showed propensity towards random coil structure. Being a part of 



Table 6. Discrete analysis. 





Protein type 


AR (%) 






LCR (%) 








Range 


Mean 


Median Range 


Mean 


Median 


DisProt human 


0.43-31.50 


8.36 


6.98 


1.41-91.94 


15.86 


10.21 


DisProt 
nonhuman 


1 .20-44.00 


9.27 


7.50 


1 .30-96.80 


16.80 


12.20 


IDEAL human 


0.69-22.37 


6.56 


5.93 


1.09-70.80 


13.74 


10.93 


IDEAL nonhuman 1.08-17.53 


7.03 


6.69 


1.67-70.67 


13.15 


8.14 



Range, Mean, Median and Mode of AR and LCR sequence percentage in 

different group of proteins. 

doi:l 0.1 371 /journal.pone.0089781 .t006 
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Table 7. Content of ARs and LCRs in a 


group of known amyloidogenic proteins. 










Name UniProt ID 


Sequence 

length LCR LCR (%) 


AR 


AR (%) 


Overlapping 
sequences 


Insulin P01308 


110 2-24 20.91 


36-42 


17.30 




99-110 


Apolipoprotein A1 P02647 


267 


8-15 


3.00 




Cold shock protein cspB P32081 


67 


14-20 


8.20 




26-34 


47-52 


Acylphosphatase2 P14621 


99 








Immunoglobulin G-binding P06654 
protein G 


448 69-114 24.55 








241-253 


379-413 


427-442 


Alpha- synuclein P37840-1 


140 10-23 


35-40 






63-78 


PI3-kinase alpha P27986 


724 79-102 7.18 


72-78 


6.40 






303-314 


263-269 








533-548 


290-296 






331-336 


401-406 


483-495 




Microtubule-associated protein PI 0636 
Tau 


441 


274-279 


1.36 




Cystatin-C P01034 


146 2-33 21.92 


10-20 


22.60 


10-20 


56-61 


84-92 


124-130 


Ig kappa chain V-l region Rei P01607 


108 


32-37 


20.40 




45-53 


71-77 


Lysozyme C P00698 


147 


52-62 


11.60 




142-147 


Major prion protein PrP P04156 


253 50-94 38.74 


8-17 


19.40 


240-252 




113-135 


171-176 








188-201 


178-185 








237-252 


222-227 






231-235 


240-253 


Sup35 P05453 


685 5-64 27.88 


9-18 


20.00 


9-18 




68-113 


31-36 




31-36 




130-142 


45-56 




45-56 




164-209 


69-74 




69-74 




241-253 


102-108 




102-108 




398-410 


260-266 






278-285 


304-313 


426-445 


471-476 


527-538 
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Table 7. Cont. 



Sequence Overlapping 
Name UniProt ID length LCR LCR (%) AR AR (%) sequences 

566-571 
584-596 



Proteins were selected from the reference 56. 
doi:10.1 371/journal.pone.0089781 .t007 

an AR both the overlapping regions was expected to induce 
aggregation in a certain solution condition. However, the LCR 
component may modulate the aggregation process in different way 
and the content may be changed depending on the solution 
condition [89]. Future experiments, starting with these overlap- 
ping ARs and LCRs, would enhance our understanding about 
how the sequence region composed of AR with low complexity 
sequences would modulate the protein aggregation process that 
lead to eventual formation of amyloid fiber. 

Conclusion 

The current investigation was focused on sequence complexity 
and content of AR present in proteins which were partially or fuUy 
disordered. The study observed a very high sequence complexity 
of the ARs and the regions not commonly overlapped with the 
LCRs which were abundant in the protein sequence. The future 
investigation may examine experimentally whether a unique 
balance between the content of AR and LCR could provide a 
suitable stability to a monomeric disordered protein to remain in a 
solution state. It would be interesting to examine how the spacing 
of LCR and AR and, swapping of AR positions influence the 
energetic of amyloid fiber formation. It will enhance our 
understanding why some proteins favor aggregation in a certain 
environment and may add more information about the mecha- 
nism of amyloid formation which is linked to several pathological 
human disorders. 

Supporting information 

Text SI Stable distribution function. Details of the 
statistical distribution function applied to AR/LCR length/ 
content distribution. 
(DOCX) 

Table SI DisProt human proteins. Protein name, database 
IDs and AR/LCR content measured by lUPred are listed. Last 
two coltunns in the tables display the number of ARs found within 
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